The objective of this third assignment is to use the ranked file from A2 and perform non-thresholded analysis with GSEA (Version 4.0.3) (Efron and Tibshirani 2019). Then the result is compared to the result from thresholded analysis in A2. Finally, cytoscape (Version 3.7.2, Java 1.8.0_162 by Oracle Corporation)(Lopes et al. 2010) is used to help to create a visualization of the up and down regulated genes along with annotations from the Bader lab. Post analysis is performed with drugBank datasets from Bader lab as well. Other details such as performing the analysis of cytoscape can be retrieved from my Journal
My data is from GSE84054 (Goh et al. 2017)
What I have changed in A1: I decided to leave the genes that have duplicated names because Ruth sugguested that they are a very small number compare to my total number of genes, ~1%. So I deleted many steps in A1, and made the cleaning step clearer by showing boxplots, density plots and MDS plots in each cleaning step.
I proved in A1 that the 17 unmapped gene ids are actually duplicates. The reason why the previous check did not catch them was because if we change the “R” to “0”, they can be detected. (They are exactly same, except locating on the Y chromosome). So here I am just going to remove them.
Loaded my normalized data from A1.
We need to explore the data again after normalization to ensure the normalized data reaches our expectations. 1. boxplot - normalized
Based on the two models in part1, I decide to base my model only on “cell_type”
There are 6988 genes that pass the p-value = 0.05 which is chosen based on the paper.
I have shown that my data is suitable for using edgeR for further analysis. The data follows the binomial distribution.
The individual dots represent each gene and the blue line is the overall trend line.
I used Quasi-likelihood models to fit my data and used QLFTest to test for differential expression. The Quasi-likelihood compares two conditions (primary tumour and tumoursphere) and shows the up and down-regulated genes. The result below that are sorted by p-value. I also inspected the number of genes that satisty my threshold and correction. I choose to use FDR correction based on the paper as well(Goh et al. 2017) . There are 7467 genes pass the p-value = 0.05, and 5033 genes that pass the FDR correction.
I determined the number of up-regulated genes by selecting every gene that does not pass my p-value: 0.05, and also have a positive log fold change. Down-regulated genes are selected in the same way with a negative log fold change. Stored these data for later enrichment analysis on gProfileR.
I have shown the up and down-regulated genes in a volcano plot by coloring them in red and blue, the code is from (Annick Moisan, n.d.)
To test the differential expression, I used the heatmap and it has shown a clear distinction between up and down regulated genes. There is a clear difference between the primary tumour samples and tumoursphere samples.(They are reversed.) The clustering is very obvious to show that differential expression exists.
There were no REAC found. I used the gprofiler2’s function to query data and also attached the screenshots that I took on their website since the package does not show the number of gene sets each has found.
The some analysis is apply to down regulated
R Libraries used:
gmt_url = "http://download.baderlab.org/EM_Genesets/current_release/Human/symbol/"
# list all the files on the server
filenames = getURL(gmt_url)
tc = textConnection(filenames)
contents = readLines(tc)
close(tc)
# get the gmt that has all the pathways and does not include terms inferred from
# electronic annotations(IEA) start with gmt file that has pathways only
rx = gregexpr("(?<=<a href=\")(.*.GOBP_AllPathways_no_GO_iea.*.)(.gmt)(?=\">)", contents,
perl = TRUE)
gmt_file = unlist(regmatches(contents, rx))
download.file(paste(gmt_url, gmt_file, sep = ""), destfile = dest_gmt_file)
# compute ranks
qlf_output_hits_withgn[,"rank"] <- log(qlf_output_hits_withgn$PValue, base = 10) * sign(qlf_output_hits_withgn$logFC)
# sort table by ranks
qlf_output_hits_withgn <- qlf_output_hits_withgn[order(qlf_output_hits_withgn$rank),]
# write gene name and rank to table
write.table(x=qlf_output_hits_withgn[, c(2, ncol(qlf_output_hits_withgn))],
file="expr_RNAseq_ranks.rnk",sep = "\t",
row.names = FALSE,col.names = FALSE,quote = FALSE)
kable(head(qlf_output_hits_withgn), caption = "Expr_RNAseq_ranks", format="html") %>%
kable_styling(bootstrap_options = "hover")
| GENEID | SYMBOL | logFC | logCPM | F | PValue | FDR | rank | |
|---|---|---|---|---|---|---|---|---|
| 6712 | ENSG00000138061 | CYP1B1 | 4.702632 | 8.7663234 | 69.53909 | 0e+00 | 1.10e-06 | -8.279169 |
| 7041 | ENSG00000140465 | CYP1A1 | 15.269415 | 6.7840800 | 103.52965 | 0e+00 | 2.30e-06 | -7.783528 |
| 3203 | ENSG00000108448 | TRIM16L | 2.649821 | 5.1277576 | 51.14183 | 1e-07 | 9.50e-06 | -7.006173 |
| 12935 | ENSG00000184292 | TACSTD2 | 2.713284 | 8.1525339 | 50.03904 | 1e-07 | 1.07e-05 | -6.920426 |
| 5676 | ENSG00000131174 | COX7B | 2.145485 | 8.2378456 | 49.36129 | 1e-07 | 1.16e-05 | -6.867103 |
| 16977 | ENSG00000235899 | RP11-345L23.1 | 6.388144 | 0.1678515 | 49.21533 | 1e-07 | 1.18e-05 | -6.855557 |
I used the computed ranked set of genes to compute the non-thresholded gene set entichment analysis with GSEA_4.0.3.(Efron and Tibshirani 2019) I loaded the rank file “expr_RNAseq_ranks.rnk” and Bader Lab gene set “human_GOBP_AllPathways_no_GO_iea_March_01_2020_symbol.gmt”. The parameters are permutation = 100, no collapse, max = 200, min = 15. The file obtained is: “A3_part1_non_thres.GseaPreranked.1584469398976”. Here is a screenshot before performing the analysis on GSEA.
My data is divided into “POS” and “NEG” which correspond to up-regulatedand down-regulated genes.
| NAME | SIZE | ES | NES | NOM.p.val | FDR.q.val | FWER.p.val | RANK.AT.MAX | LEADING.EDGE |
|---|---|---|---|---|---|---|---|---|
| TGF_BETA_RECEPTOR%IOB%TGF_BETA_RECEPTOR | 185 | -0.2044902 | NA | NA | 1 | 0 | 1836 | tags=15%, list=9%, signal=16% |
| HALLMARK_ADIPOGENESIS%MSIGDB_C2%HALLMARK_ADIPOGENESIS | 193 | -0.3041408 | NA | NA | 1 | 0 | 2169 | tags=23%, list=11%, signal=26% |
| HALLMARK_OXIDATIVE_PHOSPHORYLATION%MSIGDB_C2%HALLMARK_OXIDATIVE_PHOSPHORYLATION | 197 | -0.6852279 | NA | NA | 1 | 0 | 2021 | tags=39%, list=10%, signal=43% |
| HALLMARK_P53_PATHWAY%MSIGDB_C2%HALLMARK_P53_PATHWAY | 195 | -0.2911640 | NA | NA | 1 | 0 | 2716 | tags=24%, list=14%, signal=28% |
| TRANSPORT TO THE GOLGI AND SUBSEQUENT MODIFICATION%REACTOME DATABASE ID RELEASE 71%948021 | 155 | -0.1743973 | NA | NA | 1 | 0 | 3435 | tags=24%, list=17%, signal=29% |
| MITOTIC ANAPHASE%REACTOME DATABASE ID RELEASE 71%68882 | 173 | -0.4131416 | NA | NA | 1 | 0 | 3058 | tags=32%, list=15%, signal=37% |
| NAME | SIZE | ES | NES | NOM.p.val | FDR.q.val | FWER.p.val | RANK.AT.MAX | LEADING.EDGE |
|---|---|---|---|---|---|---|---|---|
| TGF_BETA_RECEPTOR%IOB%TGF_BETA_RECEPTOR | 185 | -0.2044902 | NA | NA | 1 | 0 | 1836 | tags=15%, list=9%, signal=16% |
| HALLMARK_ADIPOGENESIS%MSIGDB_C2%HALLMARK_ADIPOGENESIS | 193 | -0.3041408 | NA | NA | 1 | 0 | 2169 | tags=23%, list=11%, signal=26% |
| HALLMARK_OXIDATIVE_PHOSPHORYLATION%MSIGDB_C2%HALLMARK_OXIDATIVE_PHOSPHORYLATION | 197 | -0.6852279 | NA | NA | 1 | 0 | 2021 | tags=39%, list=10%, signal=43% |
| HALLMARK_P53_PATHWAY%MSIGDB_C2%HALLMARK_P53_PATHWAY | 195 | -0.2911640 | NA | NA | 1 | 0 | 2716 | tags=24%, list=14%, signal=28% |
| TRANSPORT TO THE GOLGI AND SUBSEQUENT MODIFICATION%REACTOME DATABASE ID RELEASE 71%948021 | 155 | -0.1743973 | NA | NA | 1 | 0 | 3435 | tags=24%, list=17%, signal=29% |
| MITOTIC ANAPHASE%REACTOME DATABASE ID RELEASE 71%68882 | 173 | -0.4131416 | NA | NA | 1 | 0 | 3058 | tags=32%, list=15%, signal=37% |
I used g:profiler in A2 to conduct thresholded analysis. The result showed that both the up-regulated and the down-regulated genes have metabolic terms associated. Whereas using the non-thresholded methods, immune response seems to be associated with up-regulated genes. Beta-receptor seems to be associated with the down-regulated group which is the same as the result from g:profiler. Both thresholded and non-thresholded over-representation analysis shows that the disease is associated with immune response and metabolic malfunctions. Such result is consistent with the result from the paper, and I have found some of the evidences from the other papers in A2, showing that breast cancer is associated with these top terms.
I defined the main biological themes by using clicking on “auto annotate” -> “annotate”. The system created circles around each cluster which corresponds to the most frequent node lavles in the cluster. Here is a screenshot of the overview.
The reason why I choose to do the post analysis with drugs is because the paper mentioned about finding a target for a drug. The drug is Pacritinib. However, this drug is not annotated, and thus not stored in the Bader Lab file. Therefore, I chose the top drug (Abciximab) (Law et al. 2014) from Bader lab approved drugs file. The paper also mentioned that they use Gemcitabine to treat breast cancer if the disease does not expand further.
Parameters: Mann Whiteney(One-sided greater): since I do not care whether it targets the up-regulated or down-regulated genes.
The resulting graph shows that there are no connection for Gemcitabine with any pathways, and a lot of up-regulated genes are targeted by Abciximab. The terms that are targeted are mostly the up-regulated genes, such as “response immune immunoglobin”, to reduce the elevated level of the antibody. It also reduces the level of other cell metabolic terms that are in the same pathway as the “response immune immunoglobin” (discussed in the section above). The main function of this drug is to reduce antibodies, and other over-produced molecules, but not elevating the weak terms (the down-regulated genes).
The only thing not expected is that Gemcitabine does not target the term “cell proliferation positive” which is what the main object of Gemcitabine. Gemcitabine is a drug that kills the fast-growing cells. Therefore, it is not expected to see that this drug does nothing in this disease pathway.
It might be hard to see the Gemcitabine, it is right beside Abciximab,.
Annick Moisan, Nathalie Villa-Vialaneix, Ignacio Gonzales. n.d. “Practical Statistical Analysis of Rna-Seq Data - edgeR - Tomato Data.”
Efron, Brad, and R. Tibshirani. 2019. GSA: Gene Set Analysis. https://CRAN.R-project.org/package=GSA.
Goh, Jian Yuan, Min Feng, Wenyu Wang, Gokce Oguz, Siti Maryam JM Yatim, Puay Leng Lee, Yi Bao, et al. 2017. “Chromosome 1q21.3 Amplification Is a Trackable Biomarker and Actionable Target for Breast Cancer Recurrence.” Nature Medicine 23 (11). Nature Publishing Group: 1319.
Gu, Zuguang, Roland Eils, and Matthias Schlesner. 2016. “Complex Heatmaps Reveal Patterns and Correlations in Multidimensional Genomic Data.” Bioinformatics.
Gu, Zuguang, Lei Gu, Roland Eils, Matthias Schlesner, and Benedikt Brors. 2014. “Circlize Implements and Enhances Circular Visualization in R.” Bioinformatics 30 (19): 2811–2.
Kolberg, Liis, and Uku Raudvere. 2019. Gprofiler2: Interface to the ’G:Profiler’ Toolset. https://CRAN.R-project.org/package=gprofiler2.
Law, Vivian, Craig Knox, Yannick Djoumbou, Tim Jewison, An Chi Guo, Yifeng Liu, Adam Maciejewski, et al. 2014. “DrugBank 4.0: Shedding New Light on Drug Metabolism.” Nucleic Acids Research 42 (D1). Oxford University Press: D1091–D1097.
Lopes, Christian T, Max Franz, Farzana Kazi, Sylva L Donaldson, Quaid Morris, and Gary D Bader. 2010. “Cytoscape Web: An Interactive Web-Based Network Browser.” Bioinformatics 26 (18). Oxford University Press: 2347–8.
Qiu, Xiaoyan, Xiaohui Zhu, Liang Zhang, Yuntao Mao, Jian Zhang, Peng Hao, Guohui Li, et al. 2003. “Human Epithelial Cancers Secrete Immunoglobulin G with Unidentified Specificity to Promote Growth and Survival of Tumor Cells.” Cancer Research 63 (19). AACR: 6488–95.
Rainer, Johannes. 2017. EnsDb.Hsapiens.v75: Ensembl Based Annotation Package.
Ritchie, Matthew E, Belinda Phipson, Di Wu, Yifang Hu, Charity W Law, Wei Shi, and Gordon K Smyth. 2015. “limma Powers Differential Expression Analyses for RNA-Sequencing and Microarray Studies.” Nucleic Acids Research 43 (7): e47. https://doi.org/10.1093/nar/gkv007.
Temple Lang, Duncan. 2020. RCurl: General Network (Http/Ftp/...) Client Interface for R. https://CRAN.R-project.org/package=RCurl.
Vasiliou, Vasilis, and Daniel W Nebert. 2005. “Analysis and Update of the Human Aldehyde Dehydrogenase (Aldh) Gene Family.” Human Genomics 2 (2). BioMed Central: 138.
Vassalli, Giuseppe. 2019. “Aldehyde Dehydrogenases: Not Just Markers, but Functional Regulators of Stem Cells.” Stem Cells International 2019. Hindawi.
Xie, Yihui. 2020. Knitr: A General-Purpose Package for Dynamic Report Generation in R. https://yihui.org/knitr/.
Zhu, Hao. 2019. KableExtra: Construct Complex Table with ’Kable’ and Pipe Syntax. https://CRAN.R-project.org/package=kableExtra.